{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b88f590-e457-4052-9dbd-74d7be597dc1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# =============================================================\n",
    "# Copyright © 2023 Intel Corporation\n",
    "# \n",
    "# SPDX-License-Identifier: MIT\n",
    "# ============================================================="
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f25b97a-56f7-4309-87fa-d9626baecf5e",
   "metadata": {},
   "source": [
    "# Interactive Chat Based on DialoGPT Model Using Intel® Extension for PyTorch* Quantization\n",
    "\n",
    "This code sample shows usage of DiloGPT model as interactive chat with Intel® Extension for PyTorch* INT8 quantization.\n",
    "\n",
    "## DialoGPT\n",
    "\n",
    "DialoGPT is a model based on GPT-2 architecture proposed by Microsoft in 2019. It's goal was to create open-domain chatbots capable of producing natural responses to a variety of conversational topics."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7c87090-2f40-4c29-b70b-c9d413bd3bff",
   "metadata": {},
   "source": [
    "The `Interactive chat based on DialoGPT model using Intel® Extension for PyTorch* Quantization` sample demonstrates how to create interactive chat based on pre-trained DialoGPT model and add the Intel® Extension for PyTorch* quantization to it.\n",
    "\n",
    "| Area                  | Description|\n",
    "|-----------------------|------------|\n",
    "| What you will learn   | How to create interactive chat and add INT8 dynamic quantization form Intel® Extension for PyTorch*|\n",
    "| Time to complete      | 10 minutes|\n",
    "| Category              | Concepts and Functionality|\n",
    "\n",
    "The Intel® Extension for PyTorch* extends PyTorch* with optimizations for extra performance boost on Intel® hardware. While most of the optimizations will be included in future PyTorch* releases, the extension delivers up-to-date features and optimizations for PyTorch on Intel® hardware. For example, newer optimizations include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX).\n",
    "\n",
    "## Purpose\n",
    "\n",
    "This sample shows how to create interactive chat based on the pre-trained DialoGPT model from HuggingFace and how to add INT8 dynamic quantization to it. The Intel® Extension for PyTorch* gives users the ability to speed up operations on processors with INT8 data format and specialized computer instructions. The INT8 data format uses quarter the bit width of floating-point-32 (FP32), lowering the amount of memory needed and execution time to process with minimum to zero accuracy loss.\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "| Optimized for           | Description|\n",
    "|-------------------------|------------|\n",
    "| OS                      | Ubuntu* 20.04 or newer|\n",
    "| Hardware                | Intel® Xeon® Scalable Processor family|\n",
    "| Software                | Intel® Extension for PyTorch*|"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0174e7dd-58ae-47ea-8f11-fa3d1ee8c317",
   "metadata": {},
   "source": [
    "## Environment Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc24bdae-fcb7-40a5-8bb8-76472598730b",
   "metadata": {},
   "source": [
    "### Install Jupyter notebook by Conda\n",
    "\n",
    "Please refer to the guide in README.md to setup running environment:\n",
    "\n",
    "1. Create Conda running environment.\n",
    "2. Install Jupyter notebook.\n",
    "3. Install Intel® Extension for PyTorch* for CPU packages.\n",
    "4. Startup Jupyter notebook service and open by web browser.\n",
    "\n",
    "\n",
    "#### Set Kernel to PyTorch-CPU\n",
    "\n",
    "In Jupyter notebook menu, change kernel \"PyTorch-CPU\" by Kernel->Change Kernel."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2d9847c-2d72-4dfc-a4a9-b87c987ff363",
   "metadata": {},
   "source": [
    "### Install other python packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9e7f549-65dd-4286-9e04-8d13a766c0e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m pip install transformers matplotlib"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8c1ae1f1-4878-4dc6-bbb3-5a0f17fbbd00",
   "metadata": {},
   "source": [
    "Let's start with importing all necessary packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45146b66-e41e-400e-8a1b-5e680bbb7575",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "import torch\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e158dd4-e2a7-44ca-af2d-052f88247e97",
   "metadata": {},
   "source": [
    "## Model and tokenizer loading\n",
    "\n",
    "The first implemented function is loading tokenizer and model. \n",
    "\n",
    "Function input is link to the pre-trained model. In this sample we are using `microsoft/DialoGPT-large` from HuggingFace. This is also default parameter for this function. Of course, you can use also `microsoft/DialoGPT-medium` or `microsoft/DialoGPT-samll` models. Especially if you have limited resources. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6142753-eab1-4167-9818-4b40c900473c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def load_tokenizer_and_model(model=\"microsoft/DialoGPT-large\"):\n",
    "    \"\"\"\n",
    "    Load tokenizer and model instance for some specific DialoGPT model.\n",
    "    \"\"\"\n",
    "    # Initialize tokenizer and model\n",
    "    print(\"Loading model...\")\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model, padding_side='left')\n",
    "    model = AutoModelForCausalLM.from_pretrained(model)\n",
    "    \n",
    "    # Return tokenizer and model\n",
    "    return tokenizer, model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4e150e6-4976-4998-93be-5f5f9ddcbb5b",
   "metadata": {
    "tags": []
   },
   "source": [
    "## INT8 Dynamic Quantization\n",
    "\n",
    "**Quantization** is a systematic reduction of the precision of all or several layers within the model. This means that we turn a higher-precision type, such as the FP32 (32 bits) most commonly used in Deep Learning, into a lower-precision type, such as FP16 (16 bits) or INT8 (8 bits). \n",
    "\n",
    "With type reduction, it is possible to effectively reduce the size of the model and also faster inference. That means:\n",
    "\n",
    "* lower memory bandwidth, \n",
    "* lower storage, \n",
    "* higher performance with minimum to zero accuracy loss. \n",
    "\n",
    "This is especially important, with large models such as those based on the Transformers architecture, like BERT or used in this sample GPT. \n",
    "\n",
    "We can distinguish 2 types of quantization:\n",
    "\n",
    "* static - requires an additional pass over a dataset to work, only activations do calibration,\n",
    "* dynamic - multiplies input values by the scale factor, then rounds the result to the nearest, the scale factor for activations is determined dynamically based on the data range observed in runtime.\n",
    "\n",
    "In this sample we are using **the dynamic quantization**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cca006fa-6fce-4e5f-81c0-240d12757493",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from intel_extension_for_pytorch.quantization import prepare, convert\n",
    "import intel_extension_for_pytorch as ipex\n",
    "\n",
    "def quantize_model(tokenizer, model):\n",
    "    \"\"\"\n",
    "    Adding Intel® Extension for PyTorch* dynamic quantization to the model\n",
    "    \"\"\"\n",
    "    # Evaluate model\n",
    "    model.eval()\n",
    "    \n",
    "    print(\"Quantization in progress...\")\n",
    "    \n",
    "    # Prepare example outputs for the model\n",
    "    question, text = \"What is SYCL?\", \"SYCL is an industry-driven standard, developed by Kronos Group and announced in March 2014.\"\n",
    "    inputs = tokenizer(question, text, return_tensors=\"pt\")\n",
    "    jit_inputs  = tuple((inputs['input_ids']))\n",
    "    \n",
    "    # Create configuration for dynamic quantization\n",
    "    qconfig = ipex.quantization.default_dynamic_qconfig\n",
    "    \n",
    "    # Optimize model\n",
    "    model = ipex.optimize(model)\n",
    "    \n",
    "    # Prepare model for quantization using previously prepared parameters\n",
    "    prepared_model = prepare(model, qconfig, example_inputs=jit_inputs, inplace=False)\n",
    "    \n",
    "    # Convert types in model\n",
    "    converted_model = convert(prepared_model)\n",
    "    \n",
    "    return tokenizer, converted_model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0efd690e-96bd-49aa-8a6b-863d4de3cdfa",
   "metadata": {},
   "source": [
    "## Response generation \n",
    "\n",
    "Response generation in DialoGPT architecture based on **encoder-decoder** model. It means that first we need to *encode input sentence*, to later on be able to *decode it* generating response.\n",
    "\n",
    "As the model based on transformers architecture they have known issue of copying things. To avoid repetition in chat responses we used Top-K sampling and Top-p sampling.\n",
    "\n",
    "**Top-K sampling** filters the K most likely next words and redistributes the probability mass among only those K next words. **Top-p sampling**, rather than selecting only the most likely K words, selects the smallest possible set of words whose cumulative probability exceeds the probability p. The probability mass is then redistributed among the words in this set. As a result, the size of the set of words can be dynamically increased and decreased based on the probability distribution of the next word."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d90bd2c3-ff9c-4e52-994d-341792e3e035",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def generate_response(tokenizer, model, chat_round, chat_history_ids):\n",
    "    \"\"\"\n",
    "    Generate a response to some user input.\n",
    "    \"\"\"\n",
    "    # Encode user input and End-of-String (EOS) token\n",
    "    new_input_ids = tokenizer.encode(input(\">> You:\") + tokenizer.eos_token, return_tensors='pt')\n",
    "    \n",
    "    # Append tokens to chat history\n",
    "    bot_input_ids = torch.cat([chat_history_ids, new_input_ids], dim=-1) if chat_round > 0 else new_input_ids\n",
    "    \n",
    "    # Generate response given maximum chat length history of 2000 tokens\n",
    "    chat_history_ids = model.generate(\n",
    "        bot_input_ids,\n",
    "        do_sample=True, \n",
    "        max_length=2000,\n",
    "        top_k=50, \n",
    "        top_p=0.95,\n",
    "        pad_token_id=tokenizer.eos_token_id\n",
    "    )\n",
    "    \n",
    "    # Print response\n",
    "    print(\"DialoGPT: {}\".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))\n",
    "    \n",
    "    # Return the chat history ids\n",
    "    return chat_history_ids"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db1b079b-476c-47da-8a6a-3d42fccc32d4",
   "metadata": {},
   "source": [
    "The next step is to prepare a function that allows interactive conversation for `n` rounds. This means that we will use the previously prepared `generate_response` function n-times."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28968553-b051-442d-abc2-92d8ac34415a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def chat_for_n_rounds(tokenizer, model, n=5):\n",
    "    \"\"\"\n",
    "    Chat with chatbot for n rounds (n = 5 by default)\n",
    "    \"\"\"\n",
    "\n",
    "    # Initialize history variable\n",
    "    chat_history_ids = None\n",
    "\n",
    "    # Chat for n rounds\n",
    "    for chat_round in range(n):\n",
    "        chat_history_ids = generate_response(tokenizer, model, chat_round, chat_history_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41b0f86f-2b17-41cd-911e-9ac9a92be4a0",
   "metadata": {},
   "source": [
    "Now, it is time to use implemented functions - initializing the model and adding INT8 dynamic quantization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1749c7a-4bba-4731-bbc6-da560edcfed2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize tokenizer and model\n",
    "tokenizer, model = load_tokenizer_and_model()\n",
    "\n",
    "# Adding ipex quantization to the model\n",
    "tokenizer, model = quantize_model(tokenizer, model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31bae96c-276c-463f-8085-2cd8e97b5f30",
   "metadata": {},
   "source": [
    "Let's play with the model by 5 rounds. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d79fa9d7-5713-4ceb-b489-90f1a4f6a4cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "chat_for_n_rounds(tokenizer, model, 5)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "16402940-0779-44a1-98b5-3f23c5784bd4",
   "metadata": {},
   "source": [
    "## Performance comparison\n",
    "\n",
    "Now that we know that the DialoGPT model still performs well as a chat bot after quantization, let's compare the model's performance before and after applying INT8 dynamic quantization.\n",
    "\n",
    "Let's start with defining function that will measure time that model needs for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c8b4677-4635-4abd-8eca-9ed43b9b6624",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from time import time\n",
    "def test_inference(model, data, warmup=5 , iters=25):\n",
    "    print(\"Warmup...\")\n",
    "    for i in range(warmup):\n",
    "        out = model(data)\n",
    "\n",
    "    print(\"Inference...\")\n",
    "    inference_time = 0\n",
    "    for i in range(iters):\n",
    "        start_time = time()\n",
    "        out = model(data)\n",
    "        end_time = time()\n",
    "        inference_time = inference_time + (end_time - start_time)\n",
    "\n",
    "    inference_time = inference_time / iters\n",
    "    return inference_time"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a41f2a40-2176-4f04-b277-e3622df90430",
   "metadata": {},
   "source": [
    "First, let's measure average time of inference for original model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6cb034fe-9b8b-4ee9-9975-8a6a03ce79a4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(\"Inference with FP32\")\n",
    "tokenizer_fp32, model_fp32 = load_tokenizer_and_model()\n",
    "data = torch.randint(model_fp32.config.vocab_size, size=[1, 512])\n",
    "fp32_inference_time = test_inference(model_fp32, data = data)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6c58546c-5c60-482c-9782-ac901855ddce",
   "metadata": {
    "tags": []
   },
   "source": [
    "Then, the average inference time of model after INT8 dynamic quantization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05fcd18c-0674-4715-a606-ce5ce9e42560",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(\"Inference with Dynamic INT8\")\n",
    "tokenizer_int8, model_int8 = load_tokenizer_and_model()\n",
    "tokenizer_int8, model_int8 = quantize_model(tokenizer_int8, model_int8)\n",
    "data = torch.randint(model_int8.config.vocab_size, size=[1, 512])\n",
    "int8_inference_time = test_inference(model_int8, data = data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ef0648b-c926-42ac-8367-e1a3edb067ea",
   "metadata": {},
   "source": [
    "Now, it's time to show nup the results on the bar chart using `matplotlib` library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d492e5d-e188-489b-a18d-aa32cca0a1b8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Create bar chart with training time results\n",
    "plt.figure(figsize=(4,3))\n",
    "plt.title(\"DialoGPT Inference Time\")\n",
    "plt.ylabel(\"Inference Time (seconds)\")\n",
    "plt.bar([\"FP32\", \"INT8 dynamic\"], [fp32_inference_time, int8_inference_time])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2c31e73-2d6d-4323-9609-f04191f8863d",
   "metadata": {},
   "source": [
    "DialoGPT by Microsoft is another conversational chatbot that everyone can use. \n",
    "\n",
    "Based on this architecture, we created an interactive chat in this sample. The use of top-k and top-p allowed us to avoid some of the repetition in the chat answers. Furthermore, the addition of dynamic INT8 quantization reduced memory usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b56ff32-34d0-4866-9050-df1bdf7ad736",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(\"[CODE_SAMPLE_COMPLETED_SUCCESFULLY]\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "89043271-f3dc-4d4d-a630-40c570c53d98",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
